Table of Contents:
Stevenson, WA, is a small town of about 1500 people in the Portland Metropolitan Region. Although the small city features various natural and cultural amenities, many of its residents, businesses, and organizations have expressed dissatisfaction with local internet services. In February 2018, Stevenson City Council directed city staff to pursue the creation of a strategic plan regarding broadband infrastructure.
Broadband is high-speed internet, or internet with minimum speeds of 25 megabits per second (mbps) downloading and 3 mbps uploading, according to the US Federal Communications Commission (FCC). One of our team members, Kimberly Pearson, when interning at the City of Stevenson as the Broadband Project Coordinator, created a survey to assess current internet quality and uptake for busiesses and organizations. After sending the survey to over 150 small and medium sized businesses within city limits, 40 businesses responded, a ~26% response rate. The 40 survey questions refer to the business’ chosen Internet Service Provider, the cost of their internet service, the speeds they subscribe to, their actual experienced speeds, their satisfaction with their internet plan, and more. The results were then visualized in Tableau (see plots below).
Within this report, we would like to explore the following research questions as they relate to each type of plot.
For our univariate plots, we wanted to examine the range of download and upload speeds offered by each Internet Service Provider (ISP) answer the following questions:
“Which ISP had the highest range of download speeds?”
“Which ISP had the highest range of upload speeds?”
We utilized two different box plots to answer these questions.
For our bivariate plots, we wanted to examine the “advertised download speed” of each ISP and the recorded “download speed” of each ISP to answer the following questions:
“Which ISP most accurately advertised their download speeds?”
“Which ISP most accurately advertised their upload speeds?”
We utilized two scatterplots to answer these questions.
For our first multivariate plot, we wanted to examine the relationship between “subscribed speed”, “download speed”, “upload speed”, and “satisfaction”. We specifically wanted to answer the following question:
“Does subscribed speed, download speed, and upload speed have a positive or negative effect on satisfaction, and what is the magnitude of that effect?”
We utilized a dot and whisker plot to answer this question.
For our second multivariate plot, we wanted to examine the relationship between “download speeds” and “satisfaction” by each ISP. We specifically wanted to answer the following question:
“Are higher download and upload speeds associated with higher levels of satisfaction?”
We used a heat plot to answer this question.
Our univariate plots are boxplots, which are…
We can create the plots using the following steps. Additionally, notes for each of the steps are included in the code.
# Here we load the dataset.
link='https://github.com/marcevl/Team-Assignments/raw/master/Internet.RData'
load(file=url(link))
# Then, we call the Ggplot2 library. If Ggplot2 has not been installed yet, go to the bottom right quadrant of your screen, click install, then select ggplot2
library(ggplot2)
# We create a box plot for each ISP, illustrating the ranges of their dowload speeds.
box1 = ggplot(Internet, aes(x = ISP, y=Download.Speed))
# We add information regarding the colors of the plot, and then flip it so the axes are reversed.
box2 = box1 + geom_boxplot(color='black',fill='white',
outlier.color='black',
outlier.size=2) + coord_flip()
# Here, we add a title, caption, and change the labels on each axis.
box3 = box2 + labs(title='Wave provides highest range of download speeds', subtitle = 'Stevenson, WA',
y = 'Download Speed (Megabytes/Second)',
x = NULL,
caption = 'Source: Stevenson Business and Organization Internet Survey, 2018
Note: Wave Outlier of 500 Mbps Omitteed')
# Then, we format the location, size, color, and type style of each text element.
box4 = box3 + theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
plot.title=(element_text(face='bold',
size=15,
hjust=0.5)),
plot.subtitle = (element_text(hjust = 0.5)),
plot.caption=element_text(size=8, vjust = -2),
text=element_text(color='gray25', size = 13),
panel.background = element_rect(fill='white'),
axis.ticks.y=element_blank(),
axis.line.x=element_line(color='black'))
# Finally, we change the scale of each axis so that we show the exact number of tick marks that we'd like.
box5 = box4 + scale_y_continuous(breaks=seq(0,100, 10),
limits=c(0,100)) + scale_x_discrete(limits=c('Wave','GorgeNet','CenturyLink'))
# Then, we "print" the vizualization.
box5
# We created a second box plot for each ISP, illustrating the ranges of their upload speeds.
box1 = ggplot(Internet, aes(x= ISP, y = Upload.Speed))
# We add information regarding the colors of the plot, and then flip it so the axes are reversed.
box2 = box1 + geom_boxplot(color='black',fill='white',
outlier.color='black',
outlier.size=2) + coord_flip()
# Here, we add a title, caption, and change the labels on each axis.
box3 = box2 + labs(title='Wave provides highest range of upload speeds', subtitle = 'Stevenson, WA',
y = 'Upload Speed (Megabytes/Second)',
x = NULL,
caption = 'Source: Stevenson Business and Organization Internet Survey, 2018')
# Then, we format the location, size, color, and type style of each text element.
box4 = box3 + theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
plot.title=(element_text(face='bold',
size=15,
hjust=0.5)),
plot.subtitle = (element_text(hjust = 0.5)),
plot.caption=element_text(size=8, vjust=-2),
text=element_text(color='gray25', size = 13),
panel.background = element_rect(fill='white'),
axis.ticks.y=element_blank(),
axis.line.x=element_line(color='black'))
# Finally, we change the scale of each axis so that we show the exact number of tick marks that we'd like.
box5 = box4 + scale_y_continuous(breaks=seq(0,100, 10),
limits=c(0,100)) + scale_x_discrete(limits=c('Wave','GorgeNet','CenturyLink'))
# Then, we "print" the visualization.
box5
For our bivariate plots, we decided to create scatterplots. Scatterplots use dots to represent the values for two different variables. In our case, we wanted to compare customers’ subscribed and actual speeds.
“Which ISP most accurately advertised their download speeds?” “Which ISP most accurately advertised their upload speeds?”
We can create the plots using the following steps. Additionally, notes for each of the steps are included in the code.
# For our second plots, we create a factor, ISP, with specific levels. It is based off the original variable.
ISP <- factor(Internet$ISP)
str(ISP)
## Factor w/ 6 levels "AT&T","CenturyLink",..: 2 3 6 2 3 2 1 6 6 3 ...
levels(ISP)
## [1] "AT&T" "CenturyLink" "GorgeNet" "Integra" "None"
## [6] "Wave"
# Then, we create a dataframe with just the factors that we are concerned with in these plots. It contains information for all of the ISPs.
df1 <- data.frame(ISP, Internet$Subscribed.Speed, Internet$Download.Speed, Internet$Upload.Speed)
names(df1) <- c("ISP", "Subscribed", "Download", "Upload")
# Next, we subsetted the data to only include Wave, CenturyLink, and GorgeNet.
df2 <- subset(df1, ISP %in% c("Wave", "CenturyLink", "GorgeNet"))
# We then created a base plot of Subscribed Speeds versus Upload Speeds.
base = ggplot(df2, aes(x=Subscribed, y=Upload))
# We added a scatterplot layer.
box1 = base + geom_point()
# Then, a reference line that represents the subscribed speeds.
box2 = box1 + geom_abline(a=0,b=45,lty=2,lwd=0.5, alpha = 0.5)
# We used the command facet wrap to separate the plot into three plots, one for each ISP.
box3 = box2 + facet_wrap( ~ ISP ,nrow = 1)
# Then, we added a scale for the y axis and a scale for the x axis.
box4 = box3 + scale_y_continuous(breaks=c(0,0,40,60,80,100,120),
limits = c(0, 120))
box5 = box4 + scale_x_continuous(breaks=c(0,30,60,90,120,150),
limits = c(0, 150))
# We coded for the title.
Title = 'GorgeNet Provides the Most Accurate Advertised Upload Speeds'
# We added the title, subtitle, x-axis label, y-axis label, and the caption.
box6 = box5 + labs(title=Title, subtitle = 'Stevenson, WA',
x= 'Suscribed Speed (Megabytes/Second)',
y = 'Upload Speed (Megabytes/Second)',
caption='Source: Stevenson Business and Organization Internet Survey,2018') + theme_bw()
# We formatted for each text element, as before.
box7 = box6 + theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
plot.title=(element_text(face='bold',
size=15,
hjust=0.5)),
plot.subtitle = (element_text(hjust = 0.5, vjust = 2)),
plot.caption=element_text(size=8, vjust=-2),
text=element_text(color='gray25', size = 11),
axis.line.y=element_line(color = 'black'),
axis.line.x=element_line(color='black'),
axis.text=element_text(size=8))
# Finally, we "printed" the faceted scatterplot.
box7
# The instructions to create the faceted scatterplot are the same as in the past plot.
base = ggplot(df2, aes(x=Subscribed, y=Download))
box1 = base + geom_point()
box2 = box1 + geom_abline(a=0,b=45,lty=2,lwd=0.5, alpha = 0.5)
box3 = box2 + facet_wrap( ~ ISP ,nrow = 1)
box4 = box3 + scale_y_continuous(breaks=seq(0,120,20),
limits = c(0,120))
box5 = box4 + scale_x_continuous(breaks=seq(0,120,40),
limits = c(0,120))
Title = 'GorgeNet Provides the Most Accurate Advertised Download Speeds'
box6 = box5 + labs(title=Title, subtitle = 'Stevenson, WA',
x= 'Suscribed Speed (Mbps)',
y = 'Download Speed (Mbps)',
caption='Source: Stevenson Business and Organization Internet Survey, 2018
Note: Wave Outlier of 500 Mbps Omitted') + theme_bw()
box7 = box6 + theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
plot.title=(element_text(face='bold',
size=15,
hjust=0.5)),
plot.subtitle = (element_text(hjust = 0.5, vjust = 2)),
plot.caption=element_text(size=8, vjust=-3),
text=element_text(color='gray25', size = 11),
axis.line.y=element_line(color = 'black'),
axis.line.x=element_line(color='black'),
axis.text=element_text(size=8),
axis.text.x = element_text(angle = 0, vjust = 0.7),
axis.title.x = element_text(vjust = -2, size = 10),
axis.title.y = element_text(size = 10, vjust = 2.5))
box7
We created two types of multivariate plots, a dot and whisker plot and a correlation plot.
Dot and Whisker Plot - This plot describes two models of regressions. The first model determines whether actual upload and download speeds affect customer’s satisfaction with speed.
Correlation Plot - This plot looks at the relationship between overall satisfaction and other explanatory variables
We can create the dot and whisker plot using the following steps. Additionally, notes for each of the steps are included in the code.
# Here, we called three libraries that would help us make dot and whisker plots as well as
library(dotwhisker)
library(broom)
library(dplyr)
# We created the first model of regression for Download and Upload Speeds on Satisfaction with Speed
model1=lm(Satisfaction.Speed~Download.Speed+Upload.Speed,data=Internet2[,-1])
summary(model1)
##
## Call:
## lm(formula = Satisfaction.Speed ~ Download.Speed + Upload.Speed,
## data = Internet2[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4111 -0.4159 0.4289 0.5825 1.5775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.395876 0.219081 15.501 2.49e-14 ***
## Download.Speed 0.001154 0.008049 0.143 0.887
## Upload.Speed 0.002786 0.007791 0.358 0.724
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.043 on 25 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.1219, Adjusted R-squared: 0.05169
## F-statistic: 1.736 on 2 and 25 DF, p-value: 0.1968
# Here we tidied up and mutated the model of regression so thatwe add a column, "model," with the values of Model 1.
model1_t = tidy(model1) %>% mutate(model = "Model 1") %>%
relabel_predictors(Upload.Speed = "Upload",
Download.Speed = "Download",
Subscribed.Speed = "Subscribed")
# We then created a second model of regression, this time with a fourth variable: Subscribed Speeds.
model2=lm(Satisfaction.Speed~Download.Speed+Upload.Speed+Subscribed.Speed,data=Internet2[,-1])
summary(model2)
##
## Call:
## lm(formula = Satisfaction.Speed ~ Download.Speed + Upload.Speed +
## Subscribed.Speed, data = Internet2[, -1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3054 -0.3388 0.2492 0.6909 1.6888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1851440 0.3286583 9.691 1.38e-09 ***
## Download.Speed 0.0010622 0.0098589 0.108 0.915
## Upload.Speed -0.0001521 0.0093008 -0.016 0.987
## Subscribed.Speed 0.0064991 0.0082764 0.785 0.440
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.069 on 23 degrees of freedom
## (13 observations deleted due to missingness)
## Multiple R-squared: 0.145, Adjusted R-squared: 0.03353
## F-statistic: 1.301 on 3 and 23 DF, p-value: 0.2981
# We tidied up and mutated the second model of regression to have a column called "model," with values of "Model 2"
model2_t <- tidy(model2) %>% mutate(model = "Model 2") %>%
relabel_predictors(Upload.Speed = "Upload",
Download.Speed = "Download",
Subscribed.Speed = "Subscribed")
# Here we combined all the models
allModels=rbind(model1_t, model2_t)
# Here we plot the coefficients and their confidence intervals for all models and add or change theme/text elements
dwplot(allModels) +
geom_vline(xintercept = 0,
colour = "grey 50",
linetype = 2) +
scale_colour_grey(start = .1, end = .7) +
labs(y = "Independent Variables",
title = "Relationship Between Speed Satisfaction &
Download, Upload, and Subscribed Speed",
caption = "Source:Stevenson Business and Organization Internet Survey, 2018") +
theme(plot.margin=unit(c(1,1,.5,.5),'cm'),
plot.title=(element_text(face='bold',
size=15,
hjust=0.5)),
plot.subtitle = (element_text(hjust = 0.5, vjust = 2)),
plot.caption=element_text(size=9, vjust=-2),
text=element_text(color='black', size = 11),
panel.background = element_rect(fill='white'),
legend.title = element_blank(),
axis.line.y=element_line(color = 'black'),
axis.line.x=element_line(color='black'),
axis.text=element_text(size=8))
Because all of the models’ confidence intervals overlapped with the value of 0 on the x-axis, neither model identifies a relationship between experienced speeds or subscribed speed and customers’ satisfaction with their speed.
Now that we’ve seen that broadband speeds are not necessarily correlated with download speed, we can look at other explanatory variables for relationships. A good way to do this is through a correlation plot.
#The first step is to create a new dataframe with the variables of interest
#Adding data to dataframe and removing outlier observation in row 3
df1 <- Internet2[-3,]
#Adding relevant variables to new dataframe
df.corr <- df1[,c('ISP','Cost','Subscribed.Speed','Download.Speed','Upload.Speed', 'Satisfaction.Speed', 'Satisfaction.Price','Satisfaction.Reliability','Satisfaction.Customer.Service','Satisfaction.Overall','Ease.Streaming','Ease.Downloading','Speed.Difficulty')]
#We need to convert the values for satisfaction ('Dissatisfied','Satisfied', etc.), cost, and ease ('Easy','Difficult', etc.) into numeric factors so we can plot them compare them to each other
df.corr$Cost <- as.integer(as.factor(df.corr$Cost))
df.corr$Satisfaction.Price <- as.integer(as.factor(df.corr$Satisfaction.Price))
df.corr$Satisfaction.Reliability <- as.integer(as.factor(df.corr$Satisfaction.Reliability))
df.corr$Satisfaction.Customer.Service <- as.integer(as.factor(df.corr$Satisfaction.Customer.Service))
df.corr$Satisfaction.Overall <- as.integer(as.factor(df.corr$Satisfaction.Overall))
df.corr$Ease.Streaming <- as.integer(as.factor(df.corr$Ease.Streaming))
df.corr$Ease.Downloading <- as.integer(as.factor(df.corr$Ease.Downloading))
df.corr$Speed.Difficulty <- as.integer(as.factor(df.corr$Speed.Difficulty))
Now we can plot this new dataframe (df.corr)
#Calling up the appropriate library
library(GGally)
#PVisaulizing onto a correlation plot
corrplot = ggcorr(df.corr[,-1], # all but the first column
hjust = 0.9,# distance to plot (diagonal)
size=3, # font size
layout.exp=3, # width so that variable names are shown
low = 'red',high = 'blue') # color scale
#Adding titles
corrplot1 = corrplot + labs(title='Correlation Plot for Broadband Satisfaction',
caption='Source: Stevenson Business and Organization Internet Survey, 2018')
#Adjusting the theme
corrplot2 = corrplot1 + theme(plot.title=(element_text(face='bold', #Bold title
size=15, #Font size
hjust=1)), #title Horizontal alignment
plot.caption=(element_text(hjust=1))) #Caption horizontal alignment
corrplot2
Here, higher positive correlation will apear more blue (1) while negative correlations will appear more organe (-1). While this plot is helpful, the multiple variables that measure satisfaction adds confusion. Instead, we can look at just ‘Overall Satisfaction’ and see if it is associated with any of the explanatory variables.
#Creating a new dataframe with only one variable measuring satisfaction
df.corr2 <- df.corr[,c('ISP','Satisfaction.Overall','Cost','Subscribed.Speed','Download.Speed','Upload.Speed','Ease.Streaming','Ease.Downloading')]
#Same steps as the previous plot, but with the new dataframe (df.corr2)
corrplot3 = ggcorr(df.corr2[,-1],
hjust = 0.9,
layout.exp=3,
size=3,
low = 'red',high = 'blue')
corrplot4 = corrplot3 + labs(title='Correlation Plot for Broadband Satisfaction',
subtitle='Overall Satisfaction Only',
caption='Source: Stevenson Business and Organization Internet Survey, 2018')
corrplot5 = corrplot4 + theme(plot.title=(element_text(face='bold',
size=15,
hjust=1)),
plot.subtitle=(element_text(hjust=.6)),
plot.caption=(element_text(hjust=1)))
corrplot5
This plot tells us more about the relationship between Satisfaction and the other variables. The cell with the largest negative correlation is the interaction between ‘Satisfaction.Overall’ and ‘Cost’; this indicates that people who pay more for their internet service tend to be less satisfied. The greatest positive correlation is between ‘Satisfaction.Overall’ and ‘Ease.Downloading’, which indicates that people who rate the difficulty of downloading lower are more satisfied overall with their ISP.
How does fixed broadband service vary across the state of Washington?
Below, we create a map of broadband service speeds in Washington, at the Census block code unit of analysis.